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Abstract 

In this paper we quantify the consistency of word usage in written texts represented by 
complex networks, where words were taken as nodes, by measuring the degree of preser- 
vation of the node neighborhood. Words were considered highly consistent if the authors 
used them with the same neighborhood. When ranked according to the consistency of 
use, the words obeyed a log-normal distribution, in contrast to the Zipfs law that ap- 
plies to the frequency of use. Consistency correlated positively with the familiarity and 
frequency of use, and negatively with ambiguity and age of acquisition. An inspection 
of some highly consistent words confirmed that they are used in very limited semantic 
contexts. A comparison of consistency indices for 8 authors indicated that these indices 
may be employed for author recognition. Indeed, as expected authors of novels could 
be distinguished from those who wrote scientific texts. Our analysis demonstrated the 
suitability of the consistency indices, which can now be applied in other tasks, such as 
emotion recognition. 

1 Introduction 

Since the dawn of humanity, the ability of communication has been proven to be an essential 
factor for preservation of life and for the maintenance of social relations. Writing, a major man- 
ifestation of communication, whose invention dates back to 3200 B.C., also established itself as 
one of the key skills developed by mankind. Among its main advantages over the spoken lan- 
guage are the joint capacity of portability and permanence, which ensure that thoughts, ideas, 
facts and stories are preserved. Despite this ubiquity, the writing skill cannot be considered a 
trivial or ordinary task. Even after language acquisition, the construction of a concise, precise 
and well concatenated text requires organized thought, the ability to use expressive linguistic 
resources and the analytical interpretation of reality. 

In addition to the difficulties imposed by the grammar mechanisms in written language p], 
there is a factor related to the semantic level of detail to recreate and interpret the author's 
original idea. Since it is not possible to specify all the details in a finite piece of text, the 
author must always focus on the desired level of generality. Thus, if little detail is provided 
the reader must fill in the blanks using his/her own semantic knowledge and experience about 
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the world. On the other hand, for an excessively detailed text there is no room for inferences, 
which makes reading more objective. This dichotomy between objectivity and subjectivity has 
been explored in various ways in different genres of writing. While scientific texts, newspapers, 
magazines and reports tend to use a more objective approach, literary and artistic texts tend 
to present themselves subjectively. In both cases, the degree of objectivity (or subjectivity) 
varies depending on both the text size (long texts tend to be more detailed) and number of 
descriptive words (a text with many adjectives for instance tend to be very detailed). Equally 
important seems to be context induced by words [2j [3] , since words inducing restricted contexts 
somehow limit the ability to extrapolate ideas, and this makes the text more objective. 

With this correlation between induction and objectivity as motivation [1], in this paper we 
address the problem of quantifying the level of consistency inherent in words, i.e. the degree of 
preservation of their neighbors, to understand the reasons why a word is used in a more or less 
consistent way. We use the term consistency because words whose neighbors are preserved will 
tend to be used in the same, consistent way by different authors in distinct types of text. Using 
the concepts and methodologies from complex networks [H El E] to analyze the relationship 
between concepts, we developed a series of indices to measure consistency. These indices are 
based on the idea that if a word induces a limited set of contexts, then the neighborhood of 
that word tends to be maintained even in texts written by different writers. In fact, this seems 
a reasonable assumption, since it is known that syntactically related words also tend to be 
semantically related. With the indices created using this methodology, we shall show that the 
distribution of consistency does not follow a power law [HI E] , unlike the case of the frequency 
of words (Zipf Law) [10J . Instead, the distribution seems to follow a log-normal distribution. 
Furthermore, the greater the familiarity, the number of distinct neighbors and frequency in 
the language, the more consistent is the word. As for the semantic factors, we showed that 
consistent words tend to preserve not only the lexical neighborhood, but also the semantic 
context. Finally, we show how the quantification of consistency can be useful in tasks such as 
those related to authorship recognition. 

2 Methodology 

2.1 Dataset 

The distinct contexts in which words are used were investigated with a database comprising 
several books from the Gutenberg project repository Q whose list appears in Table 13 Although 
the number of books differs among authors, the size of the corpus for each author has a fixed 
size (180,500 tokens). Thus, the difference of the corpora size has little interference in the 
analysis of consistency of words. 

2.2 Modeling Texts as Complex Networks 

Language issues have attracted the interest of many researchers in recent years [Til [T2| [13j 
HU [TSJ [T71 HH]. For instance, physicists have used dynamical systems P3H 120] and complex 
networks [2I1I221I2SIIMIES1I2SIEZIEH] to study various aspects of language. Many of these 
studies have used the co-occurrence model [271 EH EH] to link adjacent words in the text. The 

1 http: / / www.gutenberg.org/ 
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Table 1: Database employed in the experiments. 



Author 


Book 


Arthur Conan Doyle 


Uncle Bernac - A Memory of the Empire 
The Tragedy of the Korosko 
The Valley of Fear 
The War in South Africa 

ine White Uompany 
Through the Magic Door 
The Adventures of Sherlock Holmes 


Bram Stoker 


Dracula's Guest 
Ine Jewel Ur seven otars 
The Lady of the Shroud 
Lair of the White Worm 
The Man 


Charles Darwin 


boral Keets 

On the Origin of Species by Means of Natural Selection 
The Voyage of the Beagle 
The Different Forms of Flowers on Plants of the Same Species 


Charles Dickens 


American Notes 
A Tale of Two Cities 
Hard Times 
The Old Curiosity Shop 


Thomas Hardy 


A Changed Man; and Other Tales 
Desperate Remedies 
Far from the Madding Crowd 
The Hand of Ethelberta 


Pelham Grenville Wodehouse 


My Man Jeeves 
lales or St. Austin s 
The Adventures of Sally 
The Clicking of Cuthbert 
The Gem Collector 
The Man with Two Left Feet 
The Pothunters 
The Swoop! 
The White Feather 


Virginia Woolf 


Jacob's Room 
Monday or Tuesday 
Night and Day 
The Voyage Out 


William Wordsworth 


Lyrical Ballads, with Other Poems - Volume 1 
The Poetical Works of William Wordsworth - Volume 1-3 
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Table 2: The pre-processing step involves two procedures: (i) Removal of stopwords and (ii) 
Lemmatization of the remaining words. 



Original 


Without stopwords 


Lemmatized 


It was not that he felt any emotion 
akin to love for Irene Adler. 
All emotions, and that one 

particularly, were abhorrent to his 
cold , precise but admirably 
balanced mind. 


felt emotion 
akin love Irene Adler 
emotions 
abhorrent 
cold precise 
balanced mind 


feel emotion 
akin love Irene Adler 
emotion 
abhorrent 
cold precise 
balanced mind 



idea behind this modeling comes from the use of co-occurrence statistics at various scales, 
from bigram statistics to discourse scale windows, which has been widespread in document 
analysis and retrieval for at least two decades [SUl EH E21 E3] ■ Because we are interested in the 
neighborhood properties of words to examine the preservation of induced contexts, we chose to 
use this model, which is described as follows. 

The modeling procedure started with a pre-processing step, where stopwords (i.e., words 
with little semantic meaning), such as articles and prepositions, were removed from the text. 
Although the frequency of such words may be useful in distinguishing writers' personal char- 
acteristics [34] we decided to ignore them because we are only interested in the contextual 
semantic preservation. The remaining words were then lemmatized so that conjugated verbs 
and nouns in the plural form were converted respectively to their infinitive and singular forms. 
Thus words related to the same concept but with distinct inflections were taken as a single 
node in the network. This lemmatization was performed with the MXPOST part-of-speech 
tagger [53], based on a maximum entropy model [3S]- This was done to resolve ambiguities 
during the conversion to the canonical form (infinitive and plural). After this preprocessing, 
each distinct word became a node and the neighborhood relationship between words defined 
the set of edges. To illustrate the procedure, we created a small network for the following 
text extract: "It was not that he felt any emotion akin to love for Irene Adler. All emotions, 
and that one particularly, were abhorrent to his cold, precise but admirably balanced mind. " 
obtained from the book The Adventures of Sherlock Holmes, by Arthur Conan Doyle. Table [2] 
summarizes the pre-processing steps and Figure [T] displays the resulting network. 

The networks for each of the books of an author were then joined to obtain the so-called 
author's network, reflecting the association of words generated by that author. That is to say, 
if a given node appears in one of the network of books, then it will also appear in the author's 
network. Similarly, two vertices were connected in the author's network if both appeared 
connected at least in one network of the books. The derivation process is illustrated in Figure 

m 

2.3 Consistency Indices 

In this section we describe the indices proposed to measure the consistency C with which words 
are used. Because obtaining C demands that the neighborhood of each word for each author is 
known, it was calculated only for the 2, 880 words that appeared in the networks for all authors. 
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Figure 1: Example of network obtained for the extract: "It was not that he felt any emotion 
akin to love for Irene Adler. All emotions, and that one particularly, were abhorrent to his 
cold, precise but admirably balanced mind" of the book The Adventures of Sherlock Holmes, by 
Arthur Conan Doyle. 




BOOK 1 



BOOK 2 



BOOK 3 



AUTHOR'S 
NETWORK 



Figure 2: Example of derivation of a specific network for an author from the network of books. 
Note that if a neighbor appears in at least one of the networks of books, it will also appear on 
the author's network. 
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2.3.1 C Based on the Histogram of Co-occurrence of Neighbors 

To derive this index for a given word w, consider the vector storing the frequency of each 
of the k(w) distinct neighbors of w over all authors' networks. Hence, vf stores the number of 
authors for which a given neighbor i appeared connected to the word w. The element vf of 
ranges from 1 to 8, since n = 8, the number of authors' studied. The consistency based on 
increases with the sum of the vector components, since high vf means that the corresponding 
association of words is repeated by several authors. The consistency index is calculated as: 



where s is given by: 



s(w) - k{w) s{w)-k{w) 

C his t{w) = — — = — —— , (1) 

[n — l)k(w) 7k(w) 

k(w) 

«H = I>r (2) 

Note that while the numerator in eq. ([TJ represents the sum of vf derived from the minimum 
possible sum (which occurs when each of the k neighbors occurs in only one of the authors' 
networks), the denominator is the range of the sum (the largest sum occurs when every neighbor 
appears in all n networks). Consequently, Chist ranges between and 1. 

2.3.2 C Based on the Cosine Similarity 

Consistency may be calculated for words considering two distinct authors' networks A % and 
A\ comprising the set of nodes denoted by V, and Vj respectively. Initially, Vi and Vj were 
expanded (keeping the original links) so that the new set of vertices = Vi U Vj. Let w be 
the word whose consistency is being calculated. The number of neighbors of w appearing in 
both A 1 and A J is: 

M,M = J2 A ^K k (3) 

k 

In spite of capturing the number of neighbors in common, it is difficult to interpret if the 
value of Mij is high or low, since it is not normalized. We have therefore normalized equation 
[3] dividing it by the geometric mean of the degrees of the word w in both networks. Thus, if k l w 
and hi,, given by 

kw ^ A wx (4) 

X 

represent the degrees of the word w respectively in A % and A\ then the normalized number of 
shared neighbors is: 

V 4* 4 J 

X-j(w) = ^ w!^_ W k (5) 
I hi U 



In order to take into account all pairs of authors in computing the consistency, we defined 
consistency as the average of J\f(j over the n(n-l)/2 pairs of distinct networks: 



2 " " 



c » = ^ijEE^H (6) 

v ' i=l j=i+l 
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It is also worth mentioning that C cos ranges between and 1, analogously to A/?-. Actually, 
C cos can be interpreted as the cosine of the angle between the vectors whose elements indicate 
the presence or absence of a particular word as neighbor. Thus, if these vectors are similar, the 
angle between them is small and the cosine is high, indicating high consistency. 

2.3.3 C Based on the Pearson Correlation Coefficient 

One can estimate whether A4,- in eq. ^ is high by comparing it with the expected number 
of neighbors assuming a random choice of neighborhood. The expected number of common 
neighbors for word w in the networks A 1 and A 7 is given by k l w and k 3 w , as defined in equation 
[4J If w in the network A 1 randomly chooses one of its k l w neighbors, then the probability of 
choosing a node which is also a neighbor of w in A 7 would be k J w /N. Applying this reasoning 
for the k % w neighbors of w in A 1 , the expected number of common neighbors is k l w k J w /N, where 
iV represents the number of nodes in the network. Therefore, if Mij > k l w k 3 w /N, consistency 
is higher than expected. Thus, defining consistency as the difference between the actual and 
expected numbers of neighbors in common, we obtain C r : 



Li hi . „ p hi 



k 



^Kok^wk Nk w k^ 

k 
k 

/J \A l wk — k w ) (A 3 wk — k w J , (7) 



where the notation k represents the degree normalized by the number of nodes in the network: 

k w = — ^ ] A wx . (8) 

X 

Eq. ([7]) can be interpreted as a covariance, i.e., a non-normalized correlation. To transform 
C r into a normalized measurement, C r is divided by the corresponding standard deviations of 
the vectors A l wk and A J wk , k = 1 .. n. The covariance then becomes the Pearson correlation 
coefficient ry: 



J2k yKvk kw) [Muk k-w 



C' r {w) = nj = — _ = = (9) 

J2k yKuk ~ ^wj y E/c {^wk ~ K 



With consistency defined as in eq. (|9j), C' r ranges between -1 and 1. In order to restrict the 
range between zero and 1, the following linear transformation was performed in C' r , deriving 
C": 



r ' 



Cr = ^ (10) 
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Now, if C" > 0.5, then the quantity of shared neighbors is greater than the expected by 
chance. 

2.3.4 C based on the Leicht-Holme-Newman Index | |37| 

In the consistency index based on the Pearson correlation coefficient, C r derived from the dif- 
ference between the number of common neighbors found and the number of common neighbors 
expected by chance. For the Leicht-Holme-Newman Index [37] , the new measure of consistency 
is obtained from the ratio between these numbers, as follows: 

c LHN = = N 

r« w r«w l^k^-wk A^k^wk 

The important threshold now is the number 1, since for Clhn above 1, the consistency is 
higher than expected. 

2.3.5 C Based on the Sorensen Index [38] 

Unlike the metrics described above, this measure is based on the dissimilarity of the set of 
neighbors of a given word in two networks. The vectors A l w and A J W , which store the neighbor- 
hood of the word w, are used again, with the dissimilarity of neighbors being computed as the 
corresponding squared normalized Euclidean distance: 



J2k {^Kuk ^wk) ^2k C^lvk + Mvk ^^wk^u 



Ui _i_ hJ hi I I.J 

= l-2 ^> ) . (12) 

Since the distance between the vectors was divided by the maximum possible distance (given 
by k l w + k J w ), C euc ranges between and 1. To make this measure consistent with other metrics, 
we converted it from a dissimilarity to a similarity measure with the following transformation: 

C s = l-C euc = 2^M_ (13) 
2.3.6 C based on the Frequency of the Shared Neighbors 

In this index, we also consider the frequency with which a given neighbor was employed by 
each author. This new measure is justified because neighbors may appear with quite different 
frequencies. For example, if a given association of words occurs 100 times for an author and 
only once for another author (considering texts of the same size), the consistency index would 
still be maximum according to the indices described so far. However, this combination of words 
is clearly not consistent, since the frequencies are quite different. 

The disparity in frequency is considered as follows. Suppose that word w has iV distinct 
neighbors in the corpus of n authors. Let be one of the neighbors of w. If appears 



8 



associated to w f l k times in A 1 and f 3 k times in A 7 , then the consistency of w regarding the 
association w -H- Vk is: 



C*{w»v k ) = l-\JJj-§- (14) 
Jk ' Jk 

To consider all the neighbors, Cj{w) is computed as an average over C l j (w -H- provided 
t f% + fl > 0. Assuming that this condition occurs £ times, Cj{w) is: 

C /H = 7 E <7 ( 15 ) 



t 

fl+fl>o 



3 Results and Discussion 

3.1 Analysis of distribution and inter correlation of consistency in- 
dices 



The consistency indices defined in Section 2.3 were first used to examine the distribution of 



consistency for the 2,880 words under analysis. The distributions display a peak and two 
asymmetric tails for all the indices used, as illustrated in Figure [3] for the Sorensen index and 
the indices based on the frequency of words in common and on histograms. One infers that 
it is rare for a word to take extreme consistency values (low or high consistency), but it is far 
more rare for a word to take very high consistency values. Formally, this finding is confirmed 
by the log-normal distribution that is obeyed by the indices. In all cases, the data could be 
explained by a log-normal function 

A _(ini/i c ) 2 

f(x; x c , a, A) = -= — e , (16) 

V 27YCXX 



where x c , a and A correspond to the free parameters of the distribution. Table [3] summarizes 
the parameters for each case, and confirms the suitability of the fitting with Pearson-squared 
R 2 ~ 1 and chi-square y 2 1. 

Log-normal distributions are largely found in non-linguistic contexts (see e.g. [321 SOI El 
121 H21 [44]), but it is less common in linguistics, for which statistical distributions such as 
the Zipf's law [10] prevail, as in the case of frequency and ranking of words [8]. Nevertheless, 
log-normal distributions have been reported for random variables in natural language issues. 
For example, Williams [13] showed that the length of sentences seems to follow a log-normal 
distribution. Similarly, Herdan [IB] found that the length of spoken words in phone conversation 
also follows this distribution. Log-normal distributions are usually generated by processes 
following proportionality laws [13 HH1 SHJ- For this reason, the consistency can be thought 
as a result of a growth process governed by the ct^ constant, which also follows a log- normal 
distribution. To verify why this statement is true, suppose that a given word is initially used 
by only 2 authors with consistency Cq. For each new writer who uses this word, the current 
consistency increases or decreases according to the factor 

C k = a k C (17) 
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CONSISTENCY INDEX 
BASED ON THE 
SORENSEN INDEX 



CONSISTENCY INDEX 
BASED ON FREQUENCY 
OF COMMON WORDS 



CONSISTENCY INDEX 
BASED ON HISTOGRAMS 




0.16 0.24 0.32 
CONSISTENCY 



0.05 0.10 
CONSISTENCY 



0.02 0.04 0.06 0.08 
CONSISTENCY 



Figure 3: Distribution of consistency for 2,880 words. Regardless of the definition employed to 
quantify consistency a similar distribution was obtained. 



Table 3: Parameters obtained from fitting the consistency probability density functions. All 
distributions follow a log- normal distribution, since R 2 ~ 1 and \ 2 ^ 1- 



c 


x c 


o 


A 


R 2 


X 


i 




Frequency 


0.02416 ± 0.0011 


0.97641 ± 0.03091 


0.00542 ± 0.00014 


0.991 


1.53 • 


10" 


-5 


Histogram 


0.02389 ± 0.0018 


0.89291 ± 0.04733 


0.00342 ± 0.00019 


0.972 


2.68 • 


10" 


-5 


Sorensen 


0.05076 ± 0.0020 


1.04088 ± 0.02737 


0.01055 ± 0.00022 


0.994 


8.25 ■ 


io- 


-6 


Pearson 


0.04134 ± 0.0037 


1.11074 ± 0.04754 


0.00683 ± 0.00036 


0.986 


1.49 • 


io- 


-6 


Cosine 


0.06411 ± 0.0052 


1.08812 ± 0.04677 


0.01232 ± 0.00058 


0.989 


1.52 ■ 


io- 


-5 


LNH 


3.97112 ± 0.0536 


0.65551± 0.01073 


0.90414 ± 0.01281 


0.994 


1.17 ■ 


io- 


-5 



After n + 2 authors use the word, the final consistency C n will be given by: 

n 

C n = CoY[ai (18) 

i=i 

Actually, the consistency of the word is quantified as a percentage of the current consistency 
with each new use and this percentage is independent of the consistency currently observed. 
Since we assume that otk is log-normally distributed, then Ck will also follow a log-normal 
distribution, because the product of log-normal distributions also generates a log-normal dis- 
tribution |49] . The requirement that follows a log-normal distribution can be disregarded if 
we consider that many authors use the word. Because lnC n is given by: 

n 

lnC n = ^lna 4 + lnC (19) 
i=i 

and since according to the central limit theorem Y^h=i a i follows a normal distribution, it is 
possible to state that In C n is normally distributed and then C n follows a log-normal distribution. 

The similar behavior for the consistency indices in Figure [3] may mean that the indices are 
correlated. Indeed, the Pearson correlation coefficients (r) |50j between pairs of indices were 
all close to 1 (results not shown), with the exception of the LHN index. The correlation ranged 
from 0.902 for C r and Chist to 0.996 for C cos and C s . Hence, 5 of the indices are equivalent and 
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Table 4: Pearson (r) and Spearman (p) correlations between LHN and other consistency 
indices. In both cases, the correlations were low. 



c 


r 


P 


Histogram 


0.044 


0.166 


Pearson 


0.318 


0.347 


Cosine 


0.301 


0.315 


Sorensen 


0.293 


0.295 



can be used interchangeably. As for LHN, it is weakly correlated with the other 5 indices, as 
shown in the first column of Table |4| Even when the correlation of ranks is used, low values 
were observed as shown in the second column of Table |4j which summarizes the values the 
Spearman's rank correlation coefficient (p) [51]. Therefore, for quantifying consistency in texts 
in future works, it suffices to consider the measures based on the difference and on the ratio 
between the number of common neighbors and the expected number of neighbors by chance. 
For the sake of completeness, however, we chose to analyze all the indices in the next sections. 

3.2 Analysis of Correlation Between Consistency and Linguistic Fea- 
tures 

To understand how consistency measured according to the indices proposed is related to other 
linguistic factors, we examined the relationship between consistency and the number of distinct 
neighbors of the word. The scatter plots for 4 consistency indices are given in Figure [4j indi- 
cating a strong positive correlation between the number of distinct neighbors and consistency 
for all the indices (the indices not shown in Figure [4] exhibit the same behavior). Words with 
larger number of neighbors in the networks tend to be more consistent. It is possible that words 
with many neighbors (probably frequently used) are more consistent because they are used in 
a more uniform fashion as they are widely employed by different writers. Conversely, words 
with fewer neighbors are probably less frequent words that are more prone to the influence of 
the writer's personal experiences. 

Since the distribution of number of neighbors for the 2, 880 words analyzed is very broad, 
we divided them into four classes (Ca, Cb, Cc an d Cd) according to the number of neighbors 
as depicted in Table [5] The correlation between consistency and the following features was 
calculated^} (i) age of acquisition (i.e., the age at which a child begins to use the word as part 
of his/her spoken vocabulary), (ii) familiarity, (iii) imaginability, (iv) frequency of use in the 
English language; and (v) ambiguity (i.e., the number of distinct meanings extracted from the 
WordNet [53]). The correlations for each class and the corresponding consistency that provided 
the strongest correlations are shown in Table [6} 

The only case where the correlation was high occurred for familiarity in classes Cc and 
Cd, with the other correlations being below 0.3. Notwithstanding, interesting trends can be 
inferred from the sign of the correlations. For example, the negative correlation with the age 
of acquisition suggests that less consistent words take longer to be learned. This should be 
expected since a heterogeneous use of weakly consistent words makes them more difficult to 
be acquired. An analogous reasoning applies to the familiarity and imaginability (ability to 
visualize a concept), as familiar words tend to induce the same concepts (i.e., well-known words 

2 The quantities (i)-(iv) were obtained from the MRC Psycholinguistic Database (52j . 
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CONSISTENCY INDEX BASED 
ON THE FREQUENCY 
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°] 
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CONSISTENCY INDEX BASED 
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PEARSON: 0.797 

°] 
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CONSISTENCY INDEX BASED 
ON THE SORENSEN INDEX 



PEARSON: 0.805 

°n 

10 : 




-4-I — — — — I 

10 1 2 3 4 

10 10 10 10 10 

NUMBER OF DISTINCT NEIGHBOURS 



CONSISTENCY INDEX BASED 
ON THE HISTOGRAM 
PEARSON: 0.766 




-3-I 1 1 

10 1 2 3 

10 10 10 



NUMBER OF DISTINCT NEIGHBOURS 



Figure 4: Correlation between the number of distinct neighbors and the corresponding consis- 
tency. Regardless of the consistency index, there is strong correlation between consistency and 
the number of neighbors. 

Table 5: Number of words T in each group and range of the number of distinct neighbors 
rj. While class A comprises words with few distinct neighbors (up to 150), class D comprises 
words with many distinct neighbors. 



Class 


rj 


J= 


Class A 


< T) < 150 


1,070 


Class B 


151 < 7] < 300 


791 


Class C 


301 < rj < 500 


463 


Class D 


501 < rj < 800 


256 
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Table 6: Correlation obtained by comparing linguistic features and consistency of words for 
classes Ca, Cb, Cc and Co- 



Linguistic Measure 


Correlation for Ca 


Consistency Index 


Age of acquisition 


-0.09 


Histogram 


Familiarity 


+0.18 


LHN 


Imaginability 


+0.07 


Pearson 


Frequency 


+0.12 


Sorensen 


Ambiguity 


-0.13 


LHN 


Linguistic Measure 


Correlation for C B 


Consistency Index 


Age of acquisition 


-0.15 


Histogram 


Familiarity 


+0.27 


Histogram 


Imaginability 


+0.13 


LHN 


Frequency 


+0.13 


Sorensen 


Ambiguity 


-0.13 


LHN 


Linguistic Measure 


Correlation for Cc 


Consistency Index 


Age of acquisition 


-0.24 


Pearson 


Familiarity 


+0.37 


Histogram 


Imaginability 


+0.14 


Sorensen 


Frequency 


+0.03 


Cosine 


Ambiguity 


-0.14 


LHN 


Linguistic Measure 


Correlation for Co 


Consistency Index 


Age of acquisition 


-0.20 


Sorensen 


Familiarity 


+0.43 


Sorensen 


Imaginability 


+0.07 


Histogram 


Frequency 


+0.20 


Sorensen 


Ambiguity 


-0.20 


LHN 
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Table 7: Ratio between the number of neighbors of the authors' networks which are at a 
distance d from the central concept in the EAT network and the expected number of neighbors 
at a distance d assuming random associations. Interestingly, the immediate neighborhood of 
the most consistent words is also concentrated in the immediate neighborhood (d = 1) of the 
EAT networks. It turns out that consistent words induce specific contexts. 



Word 


d=l 


d=2 


d=3 


d=4 


sleep 


15.74 


0.66 


0.00 


0.18 


silence 


20.95 


2.82 


0.30 


0.36 


happy 


19.92 


1.75 


0.05 


0.00 


hair 


26.14 


1.31 


0.15 


0.00 


God 


32.79 


1.18 


0.06 


1.21 


die 


18.20 


2.28 


0.30 


0.86 


cold 


9.19 


0.69 


0.00 


0.60 


chair 


18.31 


1.93 


0.40 


0.00 



usually bring to mind the same concepts). As for the frequency in the language, the positive 
correlation indicates that widely used words are more consistent. Indeed, the widespread use 
of a word probably causes it to be used in a more homogeneous way. Finally, the ambiguity 
of the word correlated negatively with consistency. This result was also expected, since if a 
word is ambiguous then it can appear in multiple contexts, with its neighborhood tending to 
be more heterogeneous. 

In an attempt to understand why some words are more consistent than others, we examined 
the neighborhood of words possessing the highest and the lowest consistency values. By way of 
illustration, let us analyze the following strongly consistent words in Co- sleep, silence, happy, 
hair, God and cold. For each neighbor of each one of these words, we counted the number of 
authors that associated them. The histograms with the frequencies in Figure [5] indicate that 
these highly consistent words tend to have semantically related neighbors. For instance, all the 
8 authors associated sleep to night), wink and dream. To confirm this hypothesis of semantic 
proximity, we compared the neighbors for the texts written by the 8 authors with the neighbors 
of the semantic network derived from the Edinburgh Associative Thesaurus (EAT) [54] , which 
connects semantically related concepts. For each of the selected words with high consistency 
values, we computed its distance to the neighbor of the authors' network in the EAT network. 
Then, we calculated the ratio between the number of neighbors that are at a specific distance d 
and the expected number of neighbors that are at the same distance, assuming that the nodes 
were randomly chosen in the Edinburgh Associative Thesaurus network. The results in Table 
[7] show that the number of neighbors immediately related in the EAT network is much higher 
than expected (which would give a ratio = 1), thus confirming that consistency is related to 
the limitation of semantic context. 



3.3 Using the consistency index to recognize authorship 

The as-expected correlations with linguistic features confirm the suitability of the consistency 
indices to quantify semantic aspects in text. We now check whether these indices can be used 
to identify authorship as authors may use words in more or less consistent ways. In the extreme 
case in which one compares authors of distinct genres (such as storytellers and authors writing 
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SLEEP 



SILENCE 



HAPPY 



/WW 



HAIR 



GOD 



COLD 



Figure 5: Frequency of association of neighbors for the words sleep, silence, help, har, God 
and cold. Note that for the words shown in this figure (with high values of consistency), their 
neighborhood seems to be restricted to a single context. 
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Figure 6: Comparison of the average consistency indices for classes Ca, Cb, Cq and Co- The 
brighter the grayscale the smaller the p-value and the higher the difference in the average 
consistency. 



scientific books), a very good distinction should be expected. This hypothesis was investigated 
by defining for each author an average consistency as: 



c = -y>/ fc , (20) 

1 k=l 

where n p and nj represent respectively the number of distinct words and the number of tokens 
in the corpus of a given author, Ck is the consistency index for the fc-th word and fj~ is the 



frequency of the fc-th word. Using Chist to compute C for each author in eq. (20), we compared 
the average consistency of all 28 pairs of distinct authors. Figure [6] illustrates with grayscales 
the p-values from the comparison of C. While some pairs of authors are easily distinguishable 
(see e.g. Stocker and Darwin in Ca, Wolf and Hardy m. Cb, Darwin and Doyle in Cc and 
Wodehouse and Dickens in Cd), others are quite similar (Stocker and Wolf in Co)- In addition, 
counting the number of darkish grayscales one notes that the words in Ca provide a better 
ability of distinction, while class Cd provides the worst. 

Finally, to examine how the authors are clustered in terms of consistency, we used the hierar- 
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Figure 7: Hierarchical grouping using the average consistency. Note that the texts by Darwin 
can be considered as outliers, since they are predominantly scientific where words should be 
used in a more consistent way. 

chical clustering algorithm UPGMA [HE] based on the cosine similarity to generate a hierarchy 
of authors. Figure [7] shows that four clusters can be identified upon choosing an appropriate 
distance threshold. Significantly, authors of novels (such as Wodehouse), are separated from 
those of scientific works (such as Darwin), which indicates that the difference in style may be 
reflected in the use of words of distinct consistency indices. 

In summary, consistency indices are useful to detect authorship, which now can be combined 
with other conventional methods [3U \57\ EB1 EH1 ED] to enhance accuracy rates in distinguishing 
authors. 



4 Conclusion and final remarks 

In this paper we have studied the problem of quantifying the complexity of writing considering 
the consistency of words. Assuming that consistent words induce grammatically/semantically 
limited contexts, we defined several indices to measure the tendency of a word to be used 
homogeneously (i.e., preserving the context). With the various indices proposed, we found that 
the consistency of words is well fitted by a log-normal distribution, in contrast to the Zipf's 

3 Each author is characterized as a vector so that each element of the vector stores the frequency of the 
corresponding word in the author's book. This model is widely used in text mining research and is known as 
bag of words [55] . 
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law for ranking words according to frequency. Interestingly, we found that consistency can be 
seen as a multiplicative growth process. Each new writer using the word modifies the current 
consistency by adding or removing a fraction of the current value which is independent of the 
current consistency. Furthermore, more frequent and familiar words tend to be used more 
consistently, while ambiguous words and words which take a long time to be learned tend to 
be less consistent. Finally, we confirmed that the quantification of complexity in terms of the 
characterization of the consistency of words is able to distinguish authors, especially those with 
different genres of writing. 

As future work we propose the use of new metrics for consistency in order to ascertain 
whether the results are preserved. As a starting point, we intend to make use of the so-called 
Katz similarity [61] to quantify consistency using the distance between concepts. Similarly, 
we intend to further study the behavior of consistency extending the measures developed here 
considering not only the immediate neighborhood of the word, but also more distant neighbor- 
hoods. Since the consistency of words is related to the semantics, we wish to combine it with the 
features related to the writing style (for instance using CN topological measurements (HI 60J ) 
to verify if the distinguishability is enhanced. Finally, we suggest that the consistency indices 
may also be useful in applications to quantify subjectivity (in written texts or in transcribed 
speech), for subjective words might have lower consistency values because distinct persons 
probably associate distinct neighbors to subjective concepts. 
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